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END-TO-END OPTIMIZATION OF HIGH THROUGHPUT 

DNA SEQUENCING 
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Abstract. At the core of high throughput DNA sequencing platforms 
lies a bio-physical surface process that results in a random geometry of 
clusters of homogenous short DNA fragments typically hundreds of base 
pairs long. - bridge amplification. The statistical properties of this ran¬ 
dom process and length of the fragments are critical as they affect the 
information that can be subsequently extracted, i.e., density of success¬ 
fully inferred DNA fragment reads. The ensemble of overlapping DNA 
fragment reads are then used to computationally reconstruct the much 
longer target genome sequence, e.g, ranging from hundreds of thousands 
to billions of base pairs. The success of the reconstruction in turn de¬ 
pends on having a sufficiently large ensemble of DNA fragments that 
are sufficiently long. In this paper using stochastic geometry we model 
and optimize the end-to-end process linking and partially controlling the 
statistics of the physical processes to the success of the computational 
step. This provides, for the first time, a framework capturing salient 
features of such sequencing platforms that can be used to study cost, 
performance or sensitivity of the sequencing process. 


1. Introduction 

Rapid and affordable detection of the order of nucleotides in DNA mo¬ 
lecules has become an indispensable research tool in molecular biology. In 
this paper, we consider the most prevalent sequencing technology that relies 
on reversible terminator chemistry [3] and the “shot gun sequencing” strat¬ 
egy (Ml 53 ES] to determine long DNA strands. Our goal is to develop a 
model and an associated mathematical framework that enable optimization 
of the end-to-end cost of DNA sequencing. For this, we draw upon sto¬ 
chastic geometry and queueing-theoretic tools to model and analyze salient 
characteristics of growing DNA clusters on the surface of a sequencing chip 
and optimize the process of sequence assembly from the short reads provided 
by the sequencing device. These developments provide a systematic basis 
to study the tradeoffs and maximize the cost efficiency of the sequencing 
procedure. 

1.1. High-level Description of the Problem. A target DNA strand to 
be sequenced can be viewed as a possibly long, e.g. 10®, sequence of L 
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letters. In shot gun sequencing a large number of copies of the target are 
first randomly cut into fragments] the fragments are then sequenced, each 
providing a read of length I, where I is much smaller than L [SHIES]. The 
approach involves two key steps. In Step 1, one reads as many fragments as 
possible - as we elaborate later in this section, that can be parallelized and 
therefore performed very efficiently. In Step 2 one attempts to reconstruct 
the target sequence by leveraging the library of overlapping reads obtained 
in Step 1 - this step is referred to as assembly. For clarity let us consider 
the two steps independently, although one should keep in mind that they 
are intimately linked and will, in the sequel, be jointly optimized. 

To functionalize the surface of a DNA chip (referred to as a flow cell), 
DNA fragments are first scattered across its surface whereupon they at¬ 
tach at random locations |1|. A single fragment is insufficient to generate 
a signal that is detectable; to remedy this, each of the initially positioned 
fragments which we refer to as germs are replicated in parallel a number of 
times through the process called bridge amplification. The resulting ensem¬ 
bles of fragments, each comprising hundreds of identical copies of a germ, 
enable signal amplification and accurate DNA sequence detection. The germ 
replication can be viewed as a spatial branching process happening on the 
surface of the flow cell. The result of each such process, in the simplest case, 
is roughly a disc of the fragment’s copies centered at the location where the 
original strand (germ) happened to attach to the flow cell. The radius of 
the disc can be controlled by the number of steps of the bridge amplification 
process. As we shall see, due to possible interaction amongst such growth 
patterns, the resulting shapes might be more complex than discs, so in the 
sequel we will more generally refer to them as clusters. 

As mentioned above, each fragment is replicated to generate a cluster 
of identical molecules which enables signal amplification and thus facilitates 
sequence detection, i.e., the underlying letter reading mechanism. Reads are 
obtained in parallel, i.e., all clusters are read simultaneously one letter at a 
time. This is accomplished by relying on reversible terminator chemistry (3] 
where the first unread letter of each fragment is identified by detecting the 
color of the fluorescent label attached to the nucleotide bound to itQ Since 
clusters contain multiple copies of the fragments, with proper illumination 
each cluster will light up the color indicative of the latest letter/base being 
examined. This process, referred to as sequencing-by-synthesis, is applied 
sequentially for say I steps and thus in principle one can determine the first 
I letters of all fragments on the flow cell - this is the aim of Step 1. 


^More precisely, a mixture of free nucleotides each labeled with one out of four possible 
fluorescent tags is added to the flow cell; a complementary nucleotide attaches to the 
topmost unpaired base (unread letter) in each DNA fragment of the flow cell. The free 
nucleotides are modihed in such a way that once incorporated they terminate the sequence 
extension, i.e., a modified nucleotide incorporated into a strand prevents further strand 
extension. Once the fluorescent tag is identified, the nucleotide’s modification (i.e., its 
terminating property) is “reversed” and incorporation may proceed further. 
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There are two caveats however. First, the successive reading of nucleotides 
can fail for some fragments in each cluster. Specifically, on a given step, say 
k, the chemical processes associated with reading the feth nucleotide base 
may fail or jump ahead to the (fc+l)st one. Thus at each step only a fraction 
of the fragments in a cluster have their A:th nucleotide properly marked with 
the correct fluorescent marker. As this proceeds, an increasing number of 
markers get out of phase, i.e., the disc/cluster will eventually appear to 
have a mix of colors, making correct detection of the fragment’s next letter 
increasingly difficult. To deal with this phenomenon, typically referred to 
as phasing, a number of base calling methods have been proposed in recent 
years la Ea la na 0 Ej. Indeed, amplifying the signal is the reason for 
synthesizing the cluster of duplicates of each fragment in the first place, i.e., 
clusters are meant to enable in-phase addition of light emanating from a 
number of identical fluorescent tags. 

The second caveat is that randomly placed germs may be grown into 
clusters that overlap which will also impair the reading process. Larger 
clusters will tend to experience more overlaps. In essence, this is a random 
disc packing problem: if two germs happen to attach to the flow cell at 
distance r from each other, any cluster growth that leads to discs of radius 
larger than r/2 leads to such an overlap and hence to an impaired reading 
of the letters of the corresponding two strands. 

We are now in a position to articulate the main tradeoffs that drive the 
efficiency of shotgun sequencing which assembles the target using short reads 
from a flow cell. There are three main parameters at play: the density A of 
fragments initially placed on the flow cell, the length I of the fragment reads, 
and the disc radius r associated with the bridge amplification process: 

• A large looks desirable (because more fragment reads will facilitate 
Step 2) but could be problematic for any fixed r because of possible 
disc overlaps; 

• I large is desirable (because it will facilitate Step 2) but could be 
problematic because of the deterioration in reads quality as base 
pair reads get out of phase; 

• r large is necessary (to amplify signals and facilitate detection) but 
this precludes the desire for a large A (because of disc overlaps). 

We pose the following question: what is the optimal set of parameters to 
maximize the “yield” and/or possibly minimize the cost in the presence of 
all these tradeoffs? 

There are many ways to frame the problem of optimizing yield in such 
systems. In this paper we will first consider optimizing fragment yield, which 
we define as the density of length I fragments successfully read per unit area 
of the flow cell. Then we consider metrics that are more directly tied to the 
final objective of sequencing length L target DNA sequence. To that end 
let us consider Step 2, the reassembly process. 
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Following a simple, stylized model of the process, a condition for success¬ 
ful reassembly is that the collection of successfully read fragments covers the 
target DNA - i.e., if one denotes by ai, • • • , the target DNA sequence, 
then in the set of correctly read fragments there should exist a subset of 
fragments, say si,..., such that si contains the sequence oi,... a^j, S 2 
contains the sequence a^j+i,... 0 ^ 3 , • • •, and so on, until Sk contains the 
sequence apj,_^+i,... ol, for some pi, ■ ■ ■ ,Pk-i- This stylized model for re¬ 
assembly is highly simplified. In practice we may encounter two scenarios, 
de novo and reference-based sequencing. De novo sequencing [26l [TOl [l2] 
refers to sequencing genomes which have not been previously characterized, 
while reference-based sequencing [ 8 ] refers to the sequencing task where one 
or more previously sequenced references which are similar to the target are 
available. These present different challenges in the reassembly process, i.e., 
determining where fragments should be placed to reconstruct the target (se¬ 
quence alignment has attracted considerable amount of attention, e.g., see 
[m iisi [la i 2 oi mi [27! and the references therein). Moreover, reassembly 
requires additional slackness in the cover, i.e., in Si, there should be enough 
letters on the left of ap._^+i and on the right of Op. to correctly reconstruct 
the long sequence from the fragments exactly as in a puzzle. Assuming the 
ability to appropriately map/align fragments, the existence of a covering 
is a minimal requirement to sequence the target, see [3 mi and references 
therein for modern discussion of this problem. 

A critical tradeoff associated with Step 2 now emerges. It is between the 
number of correctly read fragments and their length. A large collection of 
fragments is helpful, but if they are too short, i.e., I is small, it is difficult 
to obtain a covering of the entire DNA target sequenc^ 

This brings us to the main problem addressed (and solved) in this paper. 
We set as our goal the determination of the parameters associated with the 
sequencing process as described earlier (namely the density A of fragments 
placed on the flow cell, the duration of the growth process, and the length 
of the fragment reads 1) which ensure a pre-specified probability, say 6, e.g, 
99%, of obtaining a covering of the target DNA sequence at minimal oper¬ 
ation cost. Operational costs can be organized in two main categories: (a) 
those associated with raw materials, e.g., DNA copies, reagents, flow cells; 
(b) those associated with time, e.g., the time spend on the sequencing ma¬ 
chine or the execution time of the signal processing algorithms. In this paper 
we shall for simplicity adopt the flow cell area as the cost. Some reagent 
costs and some flow cell processing time costs might be proportionaj^ to the 
flow cell area. We find it important to stress that our general mathemat¬ 
ical framework can also accommodate other costs. For instance costs that 

^Additionally, in the reference-guided assembly scenario, longer fragments are easier to 
align with the target. 

O 

In this discussion, when we say proportional, we mean proportional through multi¬ 
plicative factors that do not depend on the optimization variables, namely on the length 
of the strands, the density of the strands on the flow cell, or the duplication factor. 
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are proportional to the total number of strands (this might be the case for 
certain processing times) or to the number of DNA copies (raw material), 
rather than to the area. We will not discuss them here for the sake of brevity. 

1.2. Related Work. To our knowledge, this is the first work attempting 
to model and analyze the cluster growth process with a view on optimiz¬ 
ing DNA sequencing cost/yield. The detailed simulations of the surface 
physics associated with the bridge amplification process, [231 E2]) support 
that the disc/cluster processes we introduced earlier and will use are well 
suited. Work optimizing this process has taken place in industry where 
empirical evidence and simple rules of thumb have been used. There is, 
however, a substantial body of work towards developing mathematical tools 
for analyzing random spatial processes (see, e.g., [28] and some of our work 
m)- Indeed, this branch of mathematics is now ubiquitous with applications 
in material science, cosmology, life sciences, information theory, to name a 
few. Further developments of the mathematical foundations have recently 
been carried out by us in [3], and proven to be invaluable, e.g., to under¬ 
stand fundamental characteristics of large wireless systems and optimizing 
their performance. Such stochastic geometry models have been embraced 
by academia and industry, providing insight into current and future techno¬ 
logical developments. Indeed the aim of this paper is to show that this may 
also play a role in the DNA sequencing setting. 

As mentioned earlier in this section, sequence assembly may be performed 
with or without referring to a previously determined sequence (genome, 
transcriptome). De novo genome shotgun assembly is a computationally 
challenging task due to the presence of perfect repeat regions in the target 
and by limited lengths and accuracy of the reads HaiHlEs]. In the reference- 
guided assembly setting, the reads are first aligned (i.e., mapped) to the 
reference, easing some of the difficulties faced by the de novo assembly |8]. 
However, the reference often contains errors and gaps, creating a different 
set of challenges and problems. In fact, if the sample is highly divergent 
from the reference or if the reference is missing large regions, it may even be 
preferable to use de novo assembly HU. While the development of methods 
for sequence assembly received significant attention, ultimate limits of their 
performance have been less explored. The pioneering work of |16j provided 
the first simple mathematical model for the reassembly process. This has 
been followed with various refinements O US] but to our knowledge none 
has provided a mathematical framework to compute the flow cell parameters 
needed to achieve a given likelihood of full target coverage. Moreover, no 
previous work has linked this end objective to the optimization of the end- 
to-end process as we will do in this paper. 

1.3. Organization of the paper. In Section we propose a stochastic 
geometric model for the distribution of clusters on the flow cell resulting 
from the bridge amplification process. The proposed model enables analysis 
of the impact of the geometry of clusters on the achievable yield of fragment 
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reads. Fragment read yield optimization is considered in Section In 
Section]^ we propose a simplified model for the reassembly problem which 
is related to a queueing model and analysis. This allows us to consider the 
end-to-end cost optimization of the sequencing process to meet a desired 
likelihood of coverage for the target DNA sequence which is carried out in 
Section HI 

2. Stochastic geometric models for shotgun DNA sequencing 

In this section, we introduce basic geometric models for the cluster pro¬ 
cesses associated with the DNA fragments resulting from bridge amplifica¬ 
tion procedure on the surface of the flow cell. These are the singleton cluster, 
shot-noise and the Voronoi models, respectively. These processes will be 
tied to the salient features of fragment reading mechanisms. 

In the singleton cluster process model, all clusters that intersect (or touch) 
another cluster are discarded. The retained clusters are roughly modeled as 
discs of radius r consisting of duplicates of the same DNA fragment. In the 
shot-noise process model an attempt is made to read each cluster. Isolated 
clusters are as in the singleton cluster case. A cluster which is in contact 
with one or more clusters is still analyzed as a disc of radius r; however, 
depending on the number and shape of the other clusters in contact, part 
of the light signal stemming from that disc creates an interference which is 
treated as noise. If signal dominates interference/noise, one can still read this 
cluster. Finally, the Voronoi case studies the (hypothetical and somewhat 
futuristic) scenario where one computes optimal masks that allow one to 
mask all clusters that are in contact with the tagged cluster and hence to 
cancel interference. 

Some of these models will be used in Section for the reassembly op¬ 
timization alluded to above. For each case, we describe the mathematical 
approach used to evaluate its performance. This will also be used in order 
to give some yield optimizations of independent interest in Section 

2.1. Random seed model and growth model. We consider the locations 
of the initial DNA fragments (the centers of the clusters or the seeds) to 
be a homogenous Poisson point process N on M^, with intensity A. This 
parameter is simply the number of seeds per unit area. Growth of clusters 
is assumed to be radially homogenous, so if a cluster does not come into 
contact with any other cluster over a time r of growth, it will form a disc of 
radius r. The amplification process creates up to 1000 copies of the initial 
fragment in a disc of radius .5 microns pp. This gives us a density a of 
fragments per square micron. 

If it does come into contact with another cluster, we assume the growth 
in that direction stops, but growth continues in all other directions. As r 
approaches infinity, the configuration of clusters becomes the Voronoi dia¬ 
gram associated with the point process N |28j . For r finite, the shapes are 
known as the Johnson-Mehl growth model [28j . 
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2.2. Read reliability model. As already explained, phasing problems oc¬ 
cur because nucleotides occasionally fail to incorporate in particular dupli¬ 
cates or anneal to the base pair right next in line. These duplicates are then 
out of sync with the rest of the duplicates and give off a different color sig¬ 
nal. So even though amplifying the fragments gives a much stronger signal, 
there is noise due to these out-of-phase duplicates limiting the accuracy of 
the reads. 

To model this in a single cluster, let Xi denote the number of copies of 
the original DNA fragment that remain in phase after I steps of the process. 
Let p be the probability that a DNA fragment gets out of phase at one step. 

The random variable Xi has a binomial distribution, where the number 
of trials is the number of DNA fragment copies in a cluster, and (1 — pY 
is the probability that a fragment remains in phase, after I steps, i.e. the 
probability of success. 

At first glance, it makes sense to require that p{l) = {1 — pY > ^ (so 

that on average, more than half the duplicates remain in phase) to have a 

correct read. However, this does not capture certain phenomena, e.g. the 

fact that as the radius becomes very small, fluctuations are high around the 

mean. We will hence require that the number of duplicates remaining in 

phase is above half by some positive margin. The probability of a correct 

2 

read will then rather be P(A; > -|-e) for some positive epsilon, which is 

an important complementary parameter of our model. The value of epsilon 
used here is 10, since the output yields using this value are on the same 
order as the density of clusters achieved by Illumina technology |T]. 

2.3. The singleton cluster model. For a cluster with unimpeded growth 
over time r, the number of DNA fragments in a cluster is avrr^, so Xi ~ 
B{aTTr‘^,p{l)). 

When the number of trials is large, the binomial distribution can be ap¬ 
proximated by the normal distribution. In the singleton case: 

Xi ~ M{a'Kr‘^p{l), \/a'Kr‘^p{l){l —p{l))). 

A cluster centered at y in Ai is isolated after time r if the ball centered 
at y with radius 2r does not contain any other point of N. A homogenous 
Poisson point process is stationary, so we can consider a typical ball centered 
at 0. Given intensity A and radius r, we can then calculate the intensity of 
isolated clusters Aj(A,r). By Slivnyak’s theorem, 

Ai(A,r) = AP°(iV'(R(0,2r)) = 0) 

= AP(N(R(0,2r)) = 0) 

The overall fragment yield of singletons is the intensity of the isolated 
clusters times the probability a cluster will enjoy a correct read: 

Xy { X , r, 1 ) = Ae-4^-"'P(W > ^ + e). 
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2.4. The shot-noise model. Using only isolated clusters is clearly subop- 
timal. We consider here the situation where all clusters are used. In this 
case, for each cluster, the amount of interference from contact with other 
clusters during the growth/duplication process has to be taken into account. 
According to our growth assumptions, the area of the typical cluster (as¬ 
sumed with a seed located at 0) is |Vb n B{0,r)\, where | • | is Lebesgue 
measure, Vq is the Voronoi cell of the point at 0, and B{x,r) the ball of 
center x and radius r. Here, we use a lower bound for the area that is easier 
to calculate. The interference encountered from another cluster centered at 
Xi € N is considered to be half of the area of the overlap between the discs 
B{0,r) and B{xi,r). We take the total interference for the typical cluster 
to be the sum of these areas over all surrounding clusters. This is in fact an 
upper bound on the actual interference, e.g. triple intersections are counted 
twice (see Figure [^. 

This interference upper bound can be 
described in terms of a shot-noise field 
/tv defined on as a functional of 
our Poisson point process N with re¬ 
sponse function ar{x) = ^\B{0,r) H 
B{x, r)|. For hxed r, ar depends only on 
the distance ||x||, so we write ar(||x||). 

The total interference is / = In = 

/r 2 a,.(||a;||)A^(cix). 

The Laplace functional of the interfer¬ 
ence is: 



Li{s) = e ir2 ® Figure 1. Cluster interference. 

Overlap with the typical cluster B{0,r) 

only occurs for clusters with centers contained in the ball of radius 2r around 
0, so we consider only the interference on B{0,2r). Switching to polar 
coordinates, the Laplace transform becomes 


Li{s) 


r2r 

exp(—27rA / (1 

Jo 




where 

ar{v) = cos-1 

The random variable Xi (the number of copies of the original DNA fragment 
that remain in phase after I steps) is now binomial with the number of trials 
depending on the interference I. Since the interference comes from fragments 
that are not copies of the seed of the tagged cluster, this area contains no 
potential in-phase fragments. The area occupied by fragments of interest is 
Trr^ — I and thus, 

Xi{I) ~ Bin{a{7rr‘^ — I),p{l)), where p{l) = (1 — pY- 
The cluster is still analyzed as a disk of radius r, so the number of in-phase 
DNA fragments needed for a correct read remains at least ^avrr^ -|- e. The 
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probability of a correct reading given interference / = x is 
g{x) = P{Xi{x) > ^ovrr^ + e\I = x) 
and the fragment yield is 

Xy{r,X,l) = X j g{x)f{dx), 

where / is the law of I, which is known through its Laplace transform 
[28] , The computation of this yield, which is based on Fourier techniques, 
is discussed in Appendix | 6 . 1 [ 

2.5. The Voronoi model. In this subsection, we consider an optimal sce¬ 
nario for collecting signal reads using our assumptions about cluster growth. 
Clusters are allowed to grow until they have formed a Voronoi tessellation. 
Then optimal sized masks having the shape of each Voronoi cell are used to 
read the signal. In this scenario no interference from neighboring clusters is 
present, and the only variable to optimize is the intensity A of the underlying 
point process. 

The closed form of the distribution for the area of Voronoi cells is un¬ 
known, but it can be approximated by the generalized Gamma density: 

fxA{x) = exp(-xx'^) for x > 0. 

This is the approximate distribution for the normalized cell size XA, where 
A denotes the area. For A = 1, good choices for the area are: 7 = 1.08, 
u = 3.31, and y = 3.03 [SS]. 

For a general intensity A the area distribution is given by 

/a(x) = A/aa(Ax) = A^^^(Ax)''-^exp(-x(Ax)‘') for x > 0, 

where 7 = 1.08, u = 3.31, and x = 3.03. 

The expected fragment yield for the Voronoi case is then 
A^(A, l) = X P{Xi ><f+e\A = x)fA{x)dx. 

3. Fragment and letter yield 

In this section, we first study the fragment yield, namely the mean number 
of correctly read fragments per unit space of the flow cell. We then study 
the letter yield, which is based on the number of letters correctly read. 

3.1. Numerical results on fragment yield. Below, we use the mathe¬ 
matical expressions obtained above to optimize the yield in all three models 
for I fixed. For the singleton cluster and the shot-noise model, we optimize 
over the radius r and the intensity A. For the Voronoi model, the optimiza¬ 
tion is over A. 

Table 1 and Table 2 show optimal parameters and fragment yields for 
I = 200 and I = 150. 

For I = 200, a 45.7% increase in optimal fragment yield can be obtained 
when considering all clusters and not just the singleton ones. This increase 
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Table 1. Optimal Parameters for I = 200 


Value 

Singletons 

Shot-noise 

Voronoi 

A 

1.3981 

1.9143 

3.7478 

r 

.2386 

.2447 


F-Yield (per square micron) 

.2844 

.4145 

2.8422 


in yield comes with a slighter larger radius and higher intensity than the 
optimal parameters for singleton clusters. This corresponds to increasing 
the amount of time for replication and increasing the number of initial DNA 
fragments spread over the flow cell. This will mean more clusters overlap 
with their neighbors, but many more correct reads can still be made for 
clusters that only run into their neighbors late in the growth stage. 

Table 2. Optimal Parameters for I = 150 


Value 

Singletons 

Shot-noise 

Voronoi 

A 

3.2625 

4.6031 

8.6723 

r 

.1562 

.1686 


F-Yield (per micron^) 

.9146 

1.5704 

8.6108 


For each case, decreasing I to 150 from 200 resulted in a smaller opti¬ 
mal radius and a larger optimal intensity. With a smaller I, a fragment is 
more likely to remain in-phase, making it is easier for in-phase fragments to 
comprise half the cluster plus the hxed margin e. This allows clusters to be 
smaller and more densely packed to obtain a higher yield. 

The percent increases in the fragment yield obtained when switching to 
I = 150 are 

• Singletons: 221.59% 

• Interference: 278.87% 

• Voronoi: 202.96%. 

3.2. Numerical results on letter yield. In view of the differences be¬ 
tween I = 200 and I = 150, it makes sense to consider the optimal letter 
yield, namely lXy { l , r , X ), the mean number of letters correctly deciphered 
per unit space. So, below, optimization takes place w.r.t. I as well. 

We see that the optimizing value of I is actually somewhere around 100, 
which is shorter than the read length provided by, e.g., Illumina’s HiSeq 
sequencing platforms. 

The percent increases in the fragment yield obtained when switching from 
I = 150 to the optimal I are 

• Singletons: 31.07% 

• Interference: 49.14% 

• Voronoi: 18.89%. 
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Table 3. Optimal Parameters for Letter Yield 


Value 

Singletons 

Shot-noise 

Voronoi 

A 

6.2320 

9.5769 

12.1160 

r 

.1130 

.1233 


1 

91.4570 

86.1716 

119.8302 

L-Yield (per micron^) 

179.8272 

344.2525 

1535.7 


In this optimized setting, the letter yield of the shot-noise case is close to 
twice that of the singleton cluster case. 


4. Reassembly model and optimization 

This section is focused on the optimization of the probability of reassembly 
of the original DNA sequence. 


4.1. Reassembly model. The reassembly question can be formulated in 
terms of 

• n the number of fragments in the genomic library (fragments cor¬ 
rectly deciphered); 

• L the DNA sequence length in base pairs (letters); for the human 
genome, L = 3 billion; 

• I the length of the fragments in the same unit. 

We see L as a segment on the real line and we assume that the fragment 
starting points form a Poisson point process on the real line with parameter 

A = f- 

As already explained we are interested in the probability of complete 
reassembly. We will first reduce this to the probability that all letters are 
covered. 


4.2. Analytic expression for the reassembly probability. Within the 
Poisson setting described above, this can be reduced to a queuing theory 
question: consider an M/D/oo queue, namely a queue with Poisson arrivals, 
an infinite number of servers and a constant service time of length 1. The 
probability of reassembly is then just the probability that the busy period 
in such a queue exceeds L. The distribution of the busy period can be 
determined through its Laplace transform. 
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Let be a Poisson point process on M with parameter A. Let 
be the following random sequence of times: 


To — 0 , 

Ti = 

n = 


l, ifiV(0,0 = 0 

max X, ifA^(0,/)>0’ 


X£N, X<1 

Tk-i + 

max Xi 
xGN, x<Tk-i+l 


iiN{Tk-i,Tk-i+l) = 0 

, if7V(rfc_i,rfc_i + o>0’ 


The busy period of the queue is T = Tk, defined by 

I, if K = 1 

l + ifK>l’ 

where K is the random variable K = min{n G IN : T„ = Z} and the are 
i.i.d random variables with distribution equal to that of the last Poisson 
arrival in the interval (0,Z). We claim that the are equal in distribution 
to I — {r]\ri < 1), where rj ~ exp(A). Indeed, the distribution of the distance 
of the first point in a Poisson point process on M from 0 is the same looking 
forward and looking backward. By this and translation invariance, 

I — min lx 

x^N:x^{0^l) 


id) j . I 

= I — mm \xi 
x^N:x^{—l,0) 


I 7 

= max X +1 

xGN:xG{-1,0) 

id) 

= max X. 
x&N:x&{l,0) 



Now, let rj =^~ exp(A). The Laplace Transform of (lyl?? < 1) is 




P(r/ < 1) 
1 


1 _ g Ai 
1 


1 — e 


-XI 


Ag-(.+A)x^^ 

A 


s T A 


(1 - 
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Now, K = 1, then Tk = I- So, E[e = 1] = e For /c > 1, 


fc-i 


= e 


^—sl 




Z =1 


A 


k-l 


= e 


„ — sl 


X — s 


^-sl _ g-A/-'^ fc-1 


1 _ g Ai 


Since iF is a geometric random variable with success probability!? = P(A^(0, 1) 
0) = the Laplace transform of the busy period T is 

^t(s) = E[e""^] 


^E[e"®^^|iF = k]P{K = k) 


k=l 

oo 

= E« 

k=l 


—si 


( ^ 

1 - 

< 

1 

1 

cr> 

1 

\X-s 

1 _ g AZ 


k-l 


p{l-p) 


k-l 




E 

k=l 


A 

A — s - 


g si _ ^ \i 


k-l 


I _ l A^^g_si _ g_AZ 


e-(^+^)^(A — s) 

X — s — A(e“®^ — 6“-^^) ’ 

We want to compute P(T > L), so we invert the Laplace transform of the 
CDF, which is easily calculated from the Laplace transform of the density: 

1 


Cft{s) = -'I't(s) 


The inversion is done numerically using the Euler Inversion method [2]. 


4.3. Reassembly optimization. In our optimization, we ask the following 
question: For a given genome length, what is the smallest area the sequenc¬ 
ing flow cell needs to have in order to get a high probability that the entire 
genome can be reassembled? 

To answer this, let A be the area (in square microns) of the flow cell 
where fragments are replicated and sequenced. Then, n = AXi, where A; 
is the optimal yield per square micron for fragments of length I as derived 
in Section]^ Assume L is given. Then consider the M/D/oo queue with 
arrival rate and service time 1. 
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Figure 2 



Fragment Length 



Table 4. Optimal Parameters needed for P(T > 100,000) > .99. 



Singletons 

Shot-Noise 

Optimal 1 

97 

92 

Minimum A (square microns) 

6,483 

3,088 


For each I, we find the minimum A such that the probability of reassembly, 
P(T > L), is greater than some threshold. Finally, we optimize over I to find 
the smallest required area. The optimal I is the fragment length that requires 
the least flow cell area to obtain the desired probability of reassembly. 

Figure shows the minimum area needed to achieve a probability of 
reassembly of .99 versus the length of the fragments for a genome of length 
100 , 000 . 
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5. Conclusion 

This paper establishes a connection, which is new to the best of our knowl¬ 
edge, between stochastic geometry and queuing theory on one side, and fast 
DNA sequencing on the other side. This connection allowed us to propose a 
simple model which captures the essential steps of the fast sequencing pro¬ 
cess: segmentation of multiple copies of the DNA into random fragments, 
replication of the randomly placed fragments on the flow cell, spatial inter¬ 
actions between the resulting fragment clusters, read of fragments through 
their cluster amplification, taking the possibility of read errors and interfer¬ 
ence between clusters into account, and finally assembly of the successfully 
read fragments. This model is analytically tractable, which allowed us to 
quantify and optimize various notions of yield, including the yield of the 
end to end sequencing process, in function of the key parameters. This 
basic model seems generic and flexible enough for us to envision a series 
of increasingly realistic and yet tractable variants for each step of the pro¬ 
cess and eventually a comprehensive quantitative theory for this class of 
sequencing problems. 


6. Appendix 

6.1. Numerical method for the shot noise model. We discuss here 
the numerical evaluation of the probability of a successful read in a cluster 
in the shot-noise model: E[( 5 r(/)]. For this we need the probability density 
of I. This random variable has a mass at 0 and a continuous part with 
support on M+. The mass at 0 is P(/ = 0) = P(i?(0,2r)) = 0) > 0. For 
the continuous part, the interference must be conditioned on having a point 
in the ball i?(0,2r), since this will ensure a non-zero interference. Letting 
Po = 1P(-^ = 0) = P(N(P(0, 2r)) = 0), the expected yield is 

Xy{X,r,l) = X'Eigil)) = X {Pog{0) + {I - Po) g{x)fi\N{B{o, 2 r))>o{x)dx). 

In order to compute the integral fg g(x)ffjjY^Q(x)dx, we need to approx¬ 
imate //|Ar>o- The Fourier transform of the density can be calculated from 
the Laplace transform: 

A|^>o( 5) = E(e-“V(P(0,2r)) > 0) 

= —- E(e"“-^|iV = 0)P(N = 0)) . 

1 — Pq 

Then, //|Ar>o(s) = ~ the mean value of interest 

is 

Pb5'(0) -k /o“ g{x)p-^ (^fiis) - Po) dx . 

The inverse Fourier transform is approximated using the inverse fast Fourier 
transform {ifft in MATLAB). The total integral is approximated using the 
trapezoidal rule. 
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